Abstract
Background: Diffuse large B-cell lymphoma (DLBCL) represents the most common non-Hodgkin lymphoma subtype, with heterogeneous outcomes influenced by clinical and demographic factors. We applied machine learning approaches using Random Survival Forest (RSF) to predict overall mortality in a large population-based cohort of DLBCL patients, aiming to identify novel risk factors and improve risk stratification beyond a baseline conventional Cox model.
Methods: Using the SEER database (2000-2020), we analyzed 126,774 DLBCL patients with complete Ann Arbor staging information. Features included age (median 65 years), sex, race/ethnicity, marital status, primary site (grouped into nodal/extranodal categories), histologic type, diagnostic confirmation, and treatment modalities (surgery, radiation, chemotherapy). Data was split 70/30 for training/testing with stratification by event status. RSF was trained with 200 trees, minimum leaf size of 15, and bootstrap sampling. Model performance was evaluated via concordance index (C-index), with permutation importance for feature ranking and SHAP values for directional interpretability. Risk tertiles were derived from predicted cumulative hazard functions and validated by Kaplan-Meier analysis.
Results: The RSF model achieved a test C-index of 0.752, significantly outperforming a baseline Cox model (C-index 0.712) and traditional IPI (estimated C-index 0.692). Key predictors identified by permutation importance included advanced age >70 years (importance 0.22, HR 1.9), extranodal primary sites (importance 0.19, HR 1.6 vs. nodal), absence of chemotherapy (importance 0.15, HR 2.4), and advanced Ann Arbor stage III/IV (importance 0.18, HR 1.8). SHAP analysis revealed positive risk contributions from CNS involvement (+0.35 to log-hazard), age >70 (+0.28), and Hispanic ethnicity (+0.12, HR 1.4 vs. non-Hispanic White), while chemotherapy showed protective effects (-0.22, HR 0.7). High-risk patients (top tertile) demonstrated median survival of 18 months versus 72 months in low-risk patients (log-rank p<0.001), with 5-year overall survival rates of 28% versus 75%. The model identified treatment interactions, showing chemotherapy benefits were attenuated in patients >80 years (SHAP interaction -0.18).
Conclusion: Random Survival Forest provides superior prediction of overall mortality in DLBCL compared to traditional indices, uncovering complex interactions between age, site, and treatment that could inform personalized therapy decisions. The model highlights actionable factors including treatment access disparities and site-specific risks. These findings support the integration of machine learning approaches into clinical decision-making, with prospective validation warranted to assess clinical utility and guide implementation in routine practice.
Clinical Relevance: This study demonstrates that machine learning can enhance prognostic accuracy in DLBCL by capturing non-linear relationships between clinical variables. The identification of treatment interactions and demographic disparities provides opportunities for targeted interventions and personalized care strategies. The superior performance over traditional Cox regression models suggests potential for improved risk stratification in clinical practice.
This feature is available to Subscribers Only
Sign In or Create an Account Close Modal